The objective of this report is to determine which covarites that can be used to predict if a US county has a low or high crime rate (serious crimes per 1000 inhabitants). The dataset used to do this was county demographic information (CDI) for 440 of the most populous counites in the US 1990-1992. Each county records includes data on the 14 variables listed below in table . Counties with missing data has been removed from the dataset.
| Variable | Description |
|---|---|
| id | identification number, 1–440 |
| county | county name |
| state | state abbreviation |
| area | land area (square miles) |
| popul | estimated 1990 population |
| pop1834 | percent of 1990 CDI population aged 18–34 |
| pop65plus | percent of 1990 CDI population aged 65 years old or older |
| phys | number of professionally active nonfederal physicians during 1990 |
| beds | total number of beds, cribs and bassinets during 1990 |
| crimes | total number of serious crimes in 1990 |
| higrads | percent of adults (25 yrs old or older) who completed at least 12 years of school |
| bachelors | percent of adults (25 yrs old or older) with bachelor’s degree |
| poors | Percent of 1990 CDI population with income below poverty level |
| unemployed | percent of 1990 CDI labor force which is unemployed |
| percapitaincome | per capita income of 1990 CDI population (dollars) |
| totalincome | total personal income of 1990 CDI population (in millions of dollars) |
| region | Geographic region classification used by the U.S. Bureau of the Census, |
| including Northeast, Midwest, South and West |
In order to measure crime rate, another varible called crm1000 was added, descibing the number of serious crimes per 1000 inhabitants. Using crm1000, counties were divided into counties with high or non-high crime rate, where counties with crime rate higher than the median of crm1000 in the dataset were categorized as having a high crime rate. This crime status of the county was stored in another column called hircrm, which takes the value 1 if the county is a high crime county and 0 if it is a low crime county. In this paper, this binary varible will be used as the dependent varible. Similar to crime rate, a variable phys1000 was also added, measuring the number of physicians per 1000 inhabitants.
The first model considered has higrads as the sole covariate. As such, the model becomes:
In order to determine if there is a relationship between hicrm and higrads they are plotted against each other in figure . Because hicrm is a binary varible it is very difficult to determine if there is a relationship only by the pattern in the plot. In order to clarify this relationship, a kernel smoother was added to the plot, where a smooth line was attained with a bandwidth of 20. In addition, the fitted model along with its 95 % confidence interval are included.
Plot of against , including kernel smoothing and prediction of fitted model with 95 % confidence interval
As seen in , the kernel curve looks S-shaped, implying that a logistic model may be appropriate to describe the relationship. Further, the S-shape is “downward facing”, implying a negative \(\beta_{higrads}\), i.e. that the probability of a county being classified as a high crime decreases when the amount of higrads in the county increases. Another thing to note in figure is how few data points exists with higrads below 60, meaning that significance is low in this region. In addition, looking at points with higrads over 60, the kernel curve and fitted line only cover about 25% - 75%, implying that higrads may not predict hicrm well.
In order to study the significance of the model, the \(\beta\) values together with their 95 % confidence inteval are presented in table .
| Estimate | 2.5 % | 97.5 % | P-value | |
|---|---|---|---|---|
| \(\beta_0\) | 3.98 | 1.81 | 6.25 | 0.00044 |
| \(\beta_{higrads}\) | -0.05 | -0.08 | -0.02 | 0.00041 |
As seen in the table, all of the P-values are > 0.05, meaning that that higrads has a statistically significant effect on hicrm. This effect can be measured by looking at \(e^{\beta_{higrads}}\), showing that an increase of 1% in higrads decreases odds of hicrm by 5%, while an increase of 10% decreases odds of by 40.1%.
Using the higrads model: the probability, with 95 % confidence interval, of having a high crime rate in a county where the amount of higrads is 65 (percent), and where it is 85 (percent) is predicted. The result may be found in table .
| Higrads | Probability (%) | 2.5 % | 97.5 % |
|---|---|---|---|
| 65 | 65.6 | 55.9 | 74.2 |
| 85 | 40.6 | 34.1 | 47.6 |
In order to analyze model performance, the sensitivity and specificity of the model was calculated. The sensitivity of a model is the ratio of predicted positives to real positives in the dataset, while the specificity of a model is the ratio of predicted negatives to real negatives in the dataset. As such, the higher the value of the sensitivity and specificity, the better.
The sensitivity of the model was 55.5% and the specificity of the model was 57.3%, implying that the model does not clasify the crime rate of the counties rather successfully.
Next, a logistic model was adopted based on the region covariate. Since region is not continous, but categorial, it is modelled using “dummy variables” \(X_i\). In order to implement this effectively, one of the categories is chosen as a reference variable, and the effects of other categories are measured in comparison to it.
In order to determine this reference variable, a cross-tabulation of the data between region and hirm is studied, see table .
| Low crime | High crime | |
|---|---|---|
| Northeast | 82 | 21 |
| Midwest | 64 | 44 |
| South | 44 | 108 |
| West | 30 | 47 |
As a reference region, the one that has the largest number of counties in it’s smallest low/high category was chosen. As a tie-breaker, the other low/high category was used. This approach produces the lowest standard error, and therefore highest significance. As seen in table , the above given condition results in choosing South as reference region.
Using this reference region, the logistic model becomes \[\begin{equation} \ln{\frac{p_i}{1 - p_i}} = \beta_0 + \beta_{Northeast} \cdot X_{Northeast,i} + \beta_{Midwest} \cdot X_{Midwest,i} + \beta_{West} \cdot X_{West,i} \end{equation}\]Here, the \(\beta\) coefficients are measured relative to South, while \(\beta_0\) is log-odds coefficient for South.
The model was fit with the given data set, estimating \(\beta_i\), shown together with its 95 % confidence interval and P-value in table .
| Estimate | 2.5 % | 97.5 % | P-value | |
|---|---|---|---|---|
| \(\beta_0\) | 0.90 | 0.56 | 1.26 | 0.000 |
| \(\beta_{Northeast}\) | -2.26 | -2.87 | -1.68 | 0.000 |
| \(\beta_{Midwest}\) | -1.27 | -1.80 | -0.76 | 0.000 |
| \(\beta_{West}\) | -0.45 | -1.03 | 0.13 | 0.127 |
As may be seen in table , the P-values for \(\beta_{West}\) is not less than 0.05, indicating a lack of statistical significance in the difference between how South and West predicts hivrm.
Next, the odds-ratios for the different categories where determined. The odds-ratios measure the odds of a particular category in relation to the reference category. These may be calculated as \(OR_i = e^{\beta_i}\) and may be seen in table .
| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| Northeast | 0.10 | 0.06 | 0.19 |
| Midwest | 0.28 | 0.17 | 0.47 |
| West | 0.64 | 0.36 | 1.14 |
As seen in table , the odds-ratios are less than 1 for all categories but the reference region. This implies that the odds for all regions are lower compared to the reference region, i.e. that the probability of a high crime rate is lower in all regions compared to the reference region. This is consistent with table .
Using the fitted model, the probabilies of having a high crime rate, with confidence interval, for the different regions was determined, shown in table .
| Probability (%) | 2.5 % | 97.5 % | |
|---|---|---|---|
| Northeast | 20.4 | 12.6 | 28.2 |
| Midwest | 40.7 | 31.5 | 50.0 |
| South | 71.1 | 63.8 | 78.3 |
| West | 61.0 | 50.1 | 71.9 |
For the region model, the sensitivity was 70.5%, while the specificity was 66.4%.
Comparing the two models analyzed so far, the region model performs better measured on sensitivity and specificity, as seen in table
| Covariate | Sensitivity (%) | Specificity (%) |
|---|---|---|
| Higrads | 55.5 | 57.3 |
| Region | 70.5 | 66.4 |
Next a model that uses both higrads and region is analyzed, i.e. the following model:
In order to compare the models, metrics other than sensitivity and specificity may be studied. Some of these are AIC and BIC, which are penalized-likelihood criteria and Nagelkerke psuedo \(R^2\), which is a measurment that increses up to 1 the better the model fit is. As they are defined, AIC and BIC should be as low as possible for a model to be performant, while psuedo \(R^2\) should be as high as possible.
Comparison between the models in regards to AIC, BIC and Psuedo \(R^2\) are seen in figure , while comparison of sensitivity and specificity is seen in table .
Comparison of AIC and BIC and Nagelkerke psuedo \(R^2\) for the different models
| Covariate | Sensitivity (%) | Specificity (%) |
|---|---|---|
| Higrads | 55.5 | 57.3 |
| Region | 70.5 | 66.4 |
| Both | 70.5 | 67.3 |
As seen in figure and table , the combined model with both the covariates performs the best on all the studied metrics.
Performance of the combined model can be analyzed by studying a QQ-plot (see figure ) the squared standardized Pearson residuals and the standardized deviance residuals against the linear predictor \(x^{\beta}\) (see figure ). As well as the Cook’s distance against the linear predictor, and against higrads and against region (see figure ).
QQ-plot for the combined model
The QQ-plot seem to follow a bimodal distribution.
Squared standardized Pearson residuals as well as standardized deviance residuals for the combined model, against the linear predictor \(x^{\beta}\)
Squared standardized Pearson residuals as well as standardized deviance residuals for the combined model, against the linear predictor \(x^{\beta}\)
For a standardized pearson residual to be considered suspiciously large, \(|r_i| > |\lambda_{\alpha/2}| \approx 2\). This is true for 8 of the standadized pearson residuals, which can be seen in figures XX and XX, where the standardized pearson residuals are plotted against the linear predictor. For a deviance residual to be considered too large $|r_i| > 2 $. This is never the case which is verified by figure XX where the standardized devience residuals are plotted against the linear predictor.
In order to measure how the \(\beta\)-estimates were influenced by indiviual observations Cook’s distance for logistic regression was calcultaed and plotted.
Cook’s distance, for the combined model, against linear predictor, region as well as higrads
As can be seen in diagrams XX, XX, XX the vast majoriy of the the observations is below the horizontal line. The amount of counties with a high Cook’s distance is limited to 3-4. The data entry with the largest Cook’s distance is found in the South region while the rest of the counties with a high Cook’s distance is found in the West region. The Cook’s distance daigrams give no concern to why the combined model should be amiss
As a forth model, interaction terms are also considered, building in that the effect of higrads may be different in different regions, where the log-odds in the model includes interaction terms such as \(\beta_{Northeast * higrads} \cdot X_{higrads,i}\).
The performance of the interaction model compared to the combined model may be analyzed by the likelihood test. This test is similar to a partial F-test, but adapted for logistical regression, using likelihoods since sums of squares are not applicable. The likelihood test results in the interaction model being significantly better than the combined model, with a P-value of 0.019.
In addition, the AIC, BIC, Nagelkerke, sensitivity and specificity is compared to the combined model, in table . Performance of the interaction model can be analyzed by studying a QQ-plot (see figure ) the squared standardized Pearson residuals and the standardized deviance residuals against the linear predictor \(x^{\beta}\) (see figure ). As well as the Cook’s distance against the linear predictor, and against higrads and against region (see figure ).
| Covariate | AIC | BIC | Sensitivity (%) | Specificity (%) | Pseudo R2 |
|---|---|---|---|---|---|
| Combined model | 537 | 558 | 70 | 67 | 0.23 |
| Interaction model | 533 | 566 | 72 | 68 | 0.25 |
QQ-plot for the integration model
Squared standardized Pearson residuals as well as standardized deviance residuals for the interaction model, against the linear predictor \(x^{\beta}\)
The interaction has more Person residuals that can be considered suspisiously large than the combined model. Furthermore it has standardized devience residuals that are too large. This questions the of the credability of the interaction model.
Cook’s distance, for the interaction model, against linear predictor, region as well as higrads
TODO ändra i analysen!
Both the interaction and the combined model perform better on some metrics, while performing worse on other. The interaction model has a lower AIC value but at the same time higher BIC-value. This is too extraordinary since BIC penalizes larger more complex models more than the AIC value tends to do. The sensitivity, specificity and Pseudo \(R^2\) values are worse for the Combined model. The interaction model performs considerably worse than the combined model on the Cook’s distance acount. The interaction model have far more outliers and outliers with a much higher Cook’s distance.
As aforementioned, the interaction model has some credibility issues with its residuals. Moreover the interaction model has far more outliers that as well have markedly larger Cook’s distance than the combined model. At the same time the interaction model outperforms the combined model on likelyhood ratio test, AIC, Sensitivity, Specificity and Pseudo R2. The outliers in the interaction model might therefor not effect the model to much. In the end, the interaction model was favoured over the combined model.
Next, an attempt to fit an optimal model to predict high crime rates is made, using the previous covariates, as well as poors and pshys1000. Finding the optimal model, interaction terms were disregarded.
Models of increasingly complexity, adding more covariates are compared to each other on the used metrics, i.e. AIC, BIC, Pseudo \(R^2\), sensitivity and specificity. In addition, the result of automatic selection using R step function is studied.
AIC, BIC and Pseudo \(R^2\) of the studied model are shown in figure . In addition, table includes sensitivity and specificity for the different models.
Comparison of AIC and BIC and Nagelkerke psuedo \(R^2\) for the different models. Key: H = , R = , Po = , Phy =
| Model | AIC | BIC | Sensitivity (%) | Specificity (%) | Pseudo R2 |
|---|---|---|---|---|---|
| H | 601 | 609 | 55 | 57 | 0.04 |
| H + R | 537 | 558 | 70 | 67 | 0.23 |
| H + R + Po | 494 | 518 | 73 | 75 | 0.34 |
| H + R + Po + Phy | 481 | 509 | 74 | 75 | 0.37 |
| R + Po + Phy | 480 | 504 | 75 | 74 | 0.37 |
| H + R + Phy | 508 | 532 | 72 | 72 | 0.30 |
The results in and show that the region + poors + phys1000 model performs best on most of the metrics. This result is also consistent with the step algorithm results. As such, this model is considered the for this problem.
Performance of the optimal is then analyzed by studying a QQ-plot (see figure ) the squared standardized Pearson residuals and the standardized deviance residuals against the linear predictor \(x^{\beta}\) (see figure ). As well as the Cook’s distance against the linear predictor, and against higrads and against region (see figure ).
QQ-plot for the optimal model
The QQ plot of the optimal model seem to follow a normal distribution skewed to the left.
Squared standardized Pearson residuals as well as standardized deviance residuals for the optimal model, against the linear predictor \(x^{\beta}\)
There is one particular outlier that seem to generate an exceptionaly large Pearson residual. Furthermore there are a few Pearson Resdiuals that have a suspisiously larege value. As for the devience residuals, some of them are too large as well. Disregarding the most extreme outlier, the residuals are not far worse off than the residuals of the interaction model. The most extreme outlier is the Olmsted county. In the next section the potential influence of individual observations will be adressed, by plotting the Cook’s disatance.
Cook’s distance, for the optimal model, against linear predictor, region as well as higrads
The outlier is Olmsted, see figure for a plot of Cook’s distance excluding this point. This is the aforementioned data entry that resulted in the extreme residuals. Table compares the performance of these models.
Cook’s distance, for the optimal model, against linear predictor, region as well as higrads, excluding outlier Olmsted
Removing the problematic data entry from the CDI data frame, and refitting the optimal model resulted in a substantial improvement in the resdiual plots, see figures XX and XX, and in the the Cook’s distance plots, see figures XX and XX. Comparing the Cook’s distsance plots and resiudal plots without the problematic data entry generated relatively simlar plots as the figures XX and XX. The main difference is the largest two outliers in the figures XX and XX.
| Model | AIC | BIC | Sensitivity (%) | Specificity (%) | Pseudo R2 |
|---|---|---|---|---|---|
| Optimal | 480 | 504 | 75 | 74 | 0.37 |
| Optimal without outlier | 466 | 491 | 74 | 75 | 0.39 |
The model with the outlier removed performs better.
In order to first get a view on the different covariates and how they relate to each other, they are plotted against eachother in figure .
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plot of covariates against eachother
The optimal model includes the previously studied covariate region, but discards the higrads covariate. In addition, it includes the new poors and phys1000 covariates. One explaination why higrads is not used in the optimal may be seen in figure , where there is a high correlation between higrads and poors. With incresed multicolinearity the standrard error of the coefficents increase. In worst case this can cause some varibles to not be significant. Problems such as these are not to unlikley to happen when ivolving both the region and higrads covariates, which have a correlation of -0.692 bewteen them. This is probably one of the reasons why the optimal model performs better than the interaction model in the previous section, which had both the higrads and region as covarites.
Plot of against , together with linear regression line, with 95 % confidence and prediction intervals
Looking at how well poors predicts hicrm may be seen in figure . Here it may be seen that poors follow a more distinct S-shape, and that poors seems to split the dataset more distinctly between high and non-high crime rate, as it varies from \(\approx 15% - \approx 80%\), rather than the low separation discussed previously. As such,it seems that higrads and poors are highly correlated, but that poors better predict hicrm and is therefore better left in the model.
Plot of against , including kernel smoothing and prediction of fitted model with 95 % confidence interval
Regarding phys1000, it appears in that it does not have an as clear relationship to the other covariates and therefore provides more information to the model. Looking at how well phys1000 predicts hicrm, seen in figure , it seems to follow an approximate S-shape and therefor contributes to the model.
Plot of against , including kernel smoothing and prediction of fitted model with 95 % confidence interval
To Summarize, the optimal model ignores the higrads covariate and use region, poors and phys1000 as covariates. Doing so the